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Abstract 

In this paper we first propose a new sta- 
tistical parsing model, which is a genera- 
tive model of lexicalised context-free gram- 
mar. We then extend the model to in- 
clude a probabilistic treatment of both sub- 
catcgorisation and wh-movemcnt. Results 
on Wall Street Journal text show that the 



parser performs at 88.1/87.5% constituent 



precision/recall, an average improvement 
of 2.3% over flColfins 96| ). 



1 Introduction 

Generative models of syntax have been ce ntral in 
linguist ics since they were introduced in ( |Chom- 
sky 57 ). Each sentence-tree pair (S,T) in a lan- 



guage has an associated top-down derivation con- 
sisting of a sequence of rule applications of a gram- 
mar. These models can be extended to be statisti- 
cal by defining probability distributions at points of 
non-determinism in the derivations, thereby assign- 
ing a probability V(S, T) to each (S, T) pair. Proba- 
bilistic context free grammar (Booth and Thompson 
|73| ) was an early example of a statistical grammar. 
A PCFG can be lexicalised by associating a head- 
word with each non-terminal in a parse tree; thus 
far , ( Magerman "95] |Jelinek et al. 94| ) and flCollinsj 
|96j) , which both make heavy use of lexical informa- 
tion, have reported the best statistical parsing per- 
formance on Wall Street Journal text. Neither of 
these models is generative, instead they both esti- 
mate V{T I S) directly. 

This paper proposes three new parsing models. 
Model 1 is essentially a generative version of the 
model described in (Collins 96). In Model 2, we 
extend the parser to make the complement /adjunct 
distinction by adding probabilities over subcategori- 
sation frames for head-words. In Model 3 we give 
a probabilistic treatment of wh- movement, which 



is derived from the analysis given in Generalized 
Phrase Structure Grammar (Gazdar et al. 95). The 
work makes two advances over previous models: 
First, Model 1 performs significantly better than 
( Collins 96| ), and Models 2 and 3 give further im- 
provements — our final results are 88.1/87.5% con- 
stituent precision/recall, an average improvement 
of 2.3% over (Collins 96). Second, the parsers 



in (Collins 96) and (Magerman 95; Jelinek et al 



94) produce trees without information about wh- 
movement or subcatcgorisation. Most NLP applica- 
tions will need this information to extract predicate- 
argument structure from parse trees. 

In the remainder of this paper we describe the 3 
models in section |^, discuss practical issues in sec- 
tion ^, give results in section ^, and give conclusions 
in section ^. 

2 The Three Parsing Models 
2.1 Model 1 

In general, a statistical parsing model defines the 
conditional probability, V(T \ S), for each candidate 
parse tree T for a sentence S. The parser itself is 
an algorithm which searches for the tree, Tb es t, that 
maximises V(T \ S). A generative model uses the 
observation that maximising V(T,S) is equivalent 
to maximising V(T \ S): Q 



Ti 



V(T S) 

in- 1 = arg max V(T \ S) — arg max ^ 

= arg max V(T, S) (1) 



V(T, S) is then estimated by attaching probabilities 
to a top-down derivation of the tree. In a PCFG, 
for a tree derived by n applications of context-free 
re- write rules LHSi => RHSi, 1 < i < n, 



V{T,S) = Yl V(RHSi I LHSi) 



(2) 
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The re-write rules are either internal to the tree, 
where LHS is a non-terminal and RHS is a string 

V(S) is constant, hence maximising is equiv- 

alent to maximising ~P(T,S). 
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Figure 1: A lexicalised parse tree, and a list of the rules it contains, 
associated with each word. 



For brevity we omit the POS tag 



of one or more non-terminals; or lexical, where LHS 
is a part of speech tag and RHS is a word. 

A PCFG can be lexicalised^ by associating a word 
w and a part-of-speech (POS) tag t with each non- 
terminal X in the tree. Thus we write a non- 
terminal as X(x), where x = (w,t), and I is a 
constituent label. Each rule now has the forrrQ 

P(h) -» L n {l n )...L 1 {l 1 )H{h)R 1 {r 1 )...R m {r m ) (3) 

H is the head-child of the phrase, which inherits 
the head-word h from its parent P. L\...L n and 
R\...R m are left and right modifiers of H. Either 
n or m may be zero, and n — m = for unary 
rules. Figure [l] shows a tree which will be used as 
an example throughout this paper. 

The addition of lexical heads leads to an enormous 
number of potential rules, making direct estimation 
of P(RHS | LHS) infeasible because of sparse data 
problems. We decompose the generation of the RHS 
of a rule such as (Q), given the LHS, into three steps 
— first generating the head, then making the inde- 
pendence assumptions that the left and right modi- 
fiers are generated by separate O t,l -order markov pro- 
cesses^ 

1. Generate the head constituent label of the 
phrase, with probability Vh(H | P, h). 

2. Generate modifiers to the right of the head 
with probability U l= i m +i ^R^in) | P, h, H). 
R m+ i(r m+1 ) is defined as STOP — the STOP 
symbol is added to the vocabulary of non- 
terminals, and the model stops generating right 
modifiers when it is generated. 



2 We find lexical heads in Penn treeb ank data using 
rules which are s imilar to those used by (Magerman 95; 



Jelinek et al. 94) 

J With the exception of the top rule in the tree, which 
has the form TOP — > H(h). 

4 An exception is the first rule in the tree, TOP — > 
H(h), which has probability Ptop(H, h\TOP) 



3. Generate modifiers to the left of the head with 
probability rj i= i „+i Pi (£»('») \P, h, H), where 
L n+1 (l n+ i) = STOP. 

For example, the probability of the rule S (bought ) 
-> NP(week) NP(Marks) VP (bought) would be es- 
timated as 

P h (VP | S, bought) x "P;(NP (Marks) | S , VP , bought) x 

(NP (week) | S, VP, bought) x ^(STQP | S, VP, bought) x 
P r (ST0P | S, VP, bought) 

We have made the th order markov assumptions 

Vl(Li(li) | H,P,h,L 1 (h)...L^ 1 {l^ 1 )) = 

V l {L i {l i )\H,P,h) (4) 
V r {R i {r i )\H,P,h,R 1 {r x )...R i - 1 {r i - 1 )) = 

V r {Ri{ri)\H,P,h) (5) 

but in general the probabilities could be conditioned 
on any of the preceding modifiers. In fact, if the 
derivation order is fixed to be depth-first — that 
is, each modifier recursively generates the sub-tree 
below it before the next modifier is generated - 
then the model can also condition on any structure 
below the preceding modifiers. For the moment we 
exploit this by making the approximations 

Vi{Li{k) | H,P,h,L 1 {h)...L i _ 1 {l i _ l )) = 

Vl{Li{k) | H,P,h,distancei(i - 1)) (6) 
V r {R i (r i )\H,P,h,R 1 (r 1 )...R^ 1 (r i _ 1 )) = 

P r (Ri(ri) | H, P, h, distance r (i — 1)) (7) 

where distancei and distance r are functions of the 
surface string from the head word to the edge of the 
constituent (see figure ^) . The distance measure is 
the same as in ( Collins 96 ), a vector with the fol- 
lowing 3 elements: (1) is the string of zero length? 
(Allowing the model to learn a preference for right- 
branching structures); (2) does the string contain a 



verb? (Allowing the model to learn a preference for 
modification of the most recent verb). (3) Does the 
string contain 0, 1, 2 or > 2 commas? (where a 
comma is anything tagged as "," or ":"). 



P(h) 




Rl(rl) R2(r2) R3(r3) 



distance 



Figure 2: The next child, R^^s), is generated with 
probability P(R 3 (r 3 ) \ P, H,h,distance r (2)). The 
distance is a function of the surface string from the 
word after h to the last word of R2, inclusive. In 
principle the model could condition on any struc- 
ture dominated by H, R\ or i?2. 

2.2 Model 2: The complement /adjunct 
distinction and subcategorisation 

The tree in figure [l] is an example of the importance 
of the complement /adjunct distinction. It would be 
useful to identify "Marks" as a subject, and "Last 
week" as an adjunct (temporal modifier), but this 
distinction is not made in the tree, as both NPs are 
in the same position^ (sisters to a VP under an S 
node). From here on we will identify complements 
by attaching a "-C" suffix to non-terminals — fig- 
ure gives an example tree. 




VBD NP-C(Brooks) 

I I 

bought Brooks 



Figure 3: A tree with the "-C" suffix used to identify 
complements. "Marks" and "Brooks" are in subject 
and object position respectively. "Last week" is an 
adjunct. 

A post-processing stage could add this detail to 
the parser output, but we give two reasons for mak- 
ing the distinction while parsing: First, identifying 
complements is complex enough to warrant a prob- 
abilistic treatment. Lexical information is needed 



- for example, knowledge that "week" is likely to 
be a temporal modifier. Knowledge about subcat- 
egorisation preferences — for example that a verb 
takes exactly one subject — is also required. These 
problems are not restricted to NPs, compare "The 
spokeswoman said (SBAR that the asbestos was 
dangerous)" vs. "Bonds beat short-term invest- 
ments (SBAR because the market is down)" , where 
an SBAR headed by "that" is a complement, but an 
SBAR headed by "because" is an adjunct. 

The second reason for making the comple- 
ment/adjunct distinction while parsing is that it 
may help parsing accuracy. The assumption that 
complements are generated independently of each 
other often leads to incorrect parses — see figure || 
for further explanation. 

2.2.1 Identifying Complements and 

Adjuncts in the Penn Treebank 

We add the "-C" suffix to all non-terminals in 
training data which satisfy the following conditions: 

1. The non-terminal must be: (1) an NP, SBAR, 
or S whose parent is an S; (2) an NP, SBAR, S, 
or VP whose parent is a VP; or (3) an S whose 
parent is an SBAR. 



The non-terminal must not have one of the fol- 
lowing semantic tags: ADV, VOC, BNF, DIR, 
EXT, LOC, MNR , TMP, CLR or PRP. See 
( Marcus ct al. 94| ) for an explanation of what 
these tags signify. For example, the NP "Last 
week" in figure would have the TMP (tempo- 
ral) tag; and the SBAR in "(SBAR because the 
market is down)" , would have the ADV (adver- 
bial) tag. 



Except "Marks" is closer to the VP, but note that 
"Marks" is also the subject in "Marks last week bought 
Brooks" . 



In addition, the first child following the head of a 
prepositional phrase is marked as a complement. 

2.2.2 Probabilities over Subcategorisation 
Frames 

The model could be retrained on training data 
with the enhanced set of non-terminals, and it 
might learn the lexical properties which distinguish 
complements and adjuncts ("Marks" vs "week", or 
"that" vs. "because"). However, it would still suffer 
from the bad independence assumptions illustrated 
in figure ||. To solve these kinds of problems, the gen- 
erative process is extended to include a probabilistic 
choice of left and right subcategorisation frames: 

1. Choose a head H with probability Vh(H \ P, h). 

2. Choose left and right subcat frames, LC and 
RC, with probabilities Vi c {LC \ P,H,h) and 



1. (a) Incorrect S (b) Correct S 




-c 



Congress Congress 

Figure 4: Two examples where the assumption that modifiers are generated independently of each 
other leads to errors. In (1) the probability of generating both "Dreyfus" and "fund" as sub- 
jects, P(NP-C (Dreyfus) | S, VP, was) * 7?(NP-C (f und) | S, VP, was) is unreasonably high. (2) is similar: 
•P(NP-C(bill) , VP-C (funding) | VP, VB, was) = 7>(NP-C(bill) |VP,VB,was)*'P(VP-C(funding) |VP,VB,was) 
is a bad independence assumption. 



V rc {RC | P, H, h). Each subcat frame is a multi- 
set]^] specifying the complements which the head 
requires in its left or right modifiers. 

3. Generate the left and right modifiers with prob- 
abilities Vi(Li,li | H, P, h, distancei(i — 1),LC) 
and V r {Ri,ri \ H, P, h, distance r (i — 1), RC) re- 
spectively. Thus the subcat requirements are 
added to the conditioning context. As comple- 
ments are generated they are removed from the 
appropriate subcat multiset. Most importantly, 
the probability of generating the STOP symbol 
will be when the subcat frame is non-empty, 
and the probability of generating a complement 
will be when it is not in the subcat frame; 
thus all and only the required complements will 
be generated. 

The probability of the phrase S (bought) -> 
NP(week) NP-C(Marks) VP(bought) is now: 

V h (V? | S, bought) x 

P /c ({NP-C} | S, VP, bought) x V rc ({} I S, VP, bought) x 
Vi (NP-C (Marks) | S, VP, bought, {NP-C}) x 
Pi (NP (week) | S , VP .bought, {}) x 
7>/(ST0P | S, VP, bought, {}) x 
■p r (ST0P | S, VP, bought, {}) 

Here the head initially decides to take a sin- 
gle NP-C (subject) to its left, and no complements 

6 A multiset, or bag, is a set which may contain du- 
plicate non-terminal labels. 



to its right. NP-C (Marks) is immediately gener- 
ated as the required subject, and NP-C is removed 
from LC, leaving it empty when the next modi- 
fier, NP(week) is generated. The incorrect struc- 
tures in figure | should now have low probabil- 
ity because 7> /c ({NP-C,NP-C} | S, VP, bought) and 
7> rc ({NP-C,VP-C} | VP,VB,was) are small. 

2.3 Model 3: Traces and Wh-Movement 

Another obstacle to extracting predicate-argument 
structure from parse trees is wh-movement. This 
section describes a probabilistic treatment of extrac- 
tion from relative clauses. Noun phrases are most of- 
ten extracted from subject position, object position, 
or from within PPs: 

Example 1 The store (SBAR which TRACE 
bought Brooks Brothers) 

Example 2 The store (SBAR which Marks bought 
TRACE) 

Example 3 The store (SBAR which Marks bought 
Brooks Brothers from TRACE) 

It might be possible to write rule-based patterns 
which identify traces in a parse tree. However, we 
argue again that this task is best integrated into 
the parser: the task is complex enough to warrant 
a probabilistic treatment, and integration may help 
parsing accuracy. A couple of complexities are that 
modification by an SBAR does not always involve 
extraction (e.g., "the fact (SBAR that besoboru is 



NP (store) 




(1) NP 

(2) SBAR(+gap) 

(3) S(+gap) 

(4) VP(+gap) 



-> NP SBAR(+gap) 

-> WHNP S-C(+gap) 

-> NP-C VP(+gap) 

-> VB TRACE NP 



VBD TRACE NP(weck) 

I I 

bought last week 

Figure 5: A +gap feature can be added to non-terminals to describe NP extraction. The top-level NP 
initially generates an SBAR modifier, but specifies that it must contain an NP trace by adding the +gap 
feature. The gap is then passed down through the tree, until it is discharged as a TRACE complement to 
the right of bought. 



played with a ball and a bat)"), and it is not un- 
common for extraction to occur through several con- 
stituents, (e.g., "The changes (SBAR that he said 
the government was prepared to make TRACE)"). 

The second reason for an integrated treatment 
of traces is to improve the parameterisation of the 
model. In particular, the subcategorisation proba- 
bilities are smeared by extraction. In examples 1, 2 
and 3 above 'bought' is a transitive verb, but with- 
out knowledge of traces example 2 in training data 
will contribute to the probability of 'bought' being 
an intransitive verb. 



Formalisms similar to GPSG (Gazdar et al. 95) 
handle NP extraction by adding a gap feature to 
each non-terminal in the tree, and propagating gaps 
through the tree until they are finally discharged as a 
trace complement (see figure [5j) . In extraction cases 
the Penn treebank annotation co-indexes a TRACE 
with the WHNP head of the SBAR, so it is straight- 
forward to add this information to trees in training 
data. 

Given that the LHS of the rule has a gap, there 
are 3 ways that the gap can be passed down to the 
RHS: 



Head The gap is passed to the head of the phrase, 
as in rule (3) in figure 

Left, Right The gap is passed on recursively to one 
of the left or right modifiers of the head, or is 
discharged as a trace argument to the left/right 
of the head. In rule (2) it is passed on to a right 
modifier, the S complement. In rule (4) a trace 
is generated to the right of the head VB. 



We specify a parameter Vg{G \ P,h, H) where G 
is cither Head, Left or Right. The generative pro- 
cess is extended to choose between these cases after 
generating the head of the phrase. The rest of the 
phrase is then generated in different ways depend- 
ing on how the gap is propagated: In the Head 
case the left and right modifiers are generated as 
normal. In the Left, Right cases a gap require- 
ment is added to either the left or right SUBCAT 
variable. This requirement is fulfilled (and removed 
from the subcat list) when a trace or a modifier 
non-terminal which has the +gap feature is gener- 
ated. For example, Rule (2), SBAR(that) (+gap) -> 
WHNP(that) S-C(bought) (+gap), has probability 

^(WHNP | SBAR, that) x V G (Right | SBAR, WHNP , that) x 
Vlc{{} I SBAR , WHNP , that) x 
Vrc({5-C} | SBAR, WHNP, that) x 

Pr(S-C (bought) (+gap) | SBAR, WHNP, that, {S-C,+gap}) x 
Pk(ST0P I SBAR, WHNP, that, {}) x 
Pt(ST0P | SBAR, WHNP, that, {}) 

Rule (4), VP (bought ) (+gap) -> VB(bought) 
TRACE NP(week), has probability 

P ft (VB | VP, bought) x V G (Right | VP , bought , VB) x 
Vlc({} I VP, bought, VB) x P flC ({NP-C} | VP, bought, VB) x 
Pii(TRACE | VP, bought, VB, {NP-C, +gap}) x 
7 5 i ? (NP(week) | VP , bought ,VB, {}) x 
Pi (STOP | VP, bought, VB, {}) x 
^(STOP | VP, bought, VB, {}) 

In rule (2) Right is chosen, so the +gap requirement 
is added to RC. Generation of S-C (bought) (+gap) 
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Figure 6: The life of a constituent in the chart. (+) means a constituent is complete (i.e. it includes the 
stop probabilities), (— ) means a constituent is incomplete, (a) a new constituent is started by projecting a 
complete rule upwards; (b) the constituent then takes left and right modifiers (or none if it is unary), (c) 
finally, STOP probabilities are added to complete the constituent. 
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Table 1: The conditioning variables for each level of back-off. For example, Vh estimation interpolates 
e\ = Vh{H I P, w, t), e 2 = Vh{H | P,t), and e 3 = Vr{H \ P). A is the distance measure. 



fulfills both the S-C and +gap requirements in RC. 
In rule (4) Right is chosen again. Note that gen- 
eration of trace satisfies both the NP-C and +gap 
subcat requirements. 

3 Practical Issues 

3.1 Smoothing and Unknown Words 

Table [l] shows the various levels of back-off for each 
type of parameter in the model. Note that we de- 
compose VL(Li(lwi,lti) | P,H,w,t,A,LC) (where 
Iwi and Iti are the word and POS tag generated 
with non-terminal Li, A is the distance measure) 
into the product V L i{Li{lti) \ P, H,w,t, A, LC) x 
Vhi^Wi | Li, lti,P, H, w, t, A, LC), and then smooth 
these two probabilities separately (Jason Eisner, 
p.c). In each case^lthe final estimate is 

e = Aiei + (1 - Ai)(A 2 e 2 + (1 - A 2 )e 3 ) 

where e±, e 2 and e 3 are maximum likelihood esti- 
mates with the context at levels 1, 2 and 3 in the 
table, and Ai, A 2 and A3 are smoothing parameters 
where < A, < 1. All words occurring less than 5 
times in training data, and words in test data which 



have never been seen in training, are replaced with 
the "UNKNOWN" token. This allows the model to 
robustly handle the statistics for rare or new words. 

3.2 Part of Speech Tagging and Parsing 

Part of speech tags are generated along with the 
words in this model. When parsing, the POS tags al- 
lowed for each word are limited to those which have 
been seen in training data for that word. For un- 
known words, the output from the tagger described 
in ( Ratnaparkhi 96| ) is used as the single possible tag 
for that word. A CKY style dynamic programming 
chart parser is used to find the maximum probability 
tree for each sentence (see figure ^) . 

4 Results 

The parser was trained on sections 02 - 21 of the Wall 
Street Journal portion of the Penn Treebank flMar 



|cus et al. 93 ) (approximately 40,000 sentences), and 



7 Except cases L2 and R2, which have 4 levels, so that 
e = Aiei + (1 - Ai)(A 2 e 2 + (1 - A 2 )(A 3 e 3 + (1 - A 3 )e 4 )). 



tested on section 23 (2,416 sentences). We use the 
PARSEVAL measures (Black et al. 91) to compare 
performance: 

Labeled Precision = 

number of correct constituents in proposed par se 

number of constituents in proposed parse 



MODEL 


< 40 Words (2245 sentences) 


< 100 Words (2416 sentences) 


LR 


LP 


CBs 


CBs 


< 2 CBs 


LR 


LP 


CBs 


CBs 


< 2 CBs 


flMagerman 9S|) 
(Collins 9£) 


84.6% 
85.8% 


84.9% 
86.3% 


1.26 
1.14 


56.6% 
59.9% 


81.4% 
83.6% 


84.0% 
85.3% 


84.3% 
85.7% 


1.46 
1.32 


54.0% 
57.2% 


78.8% 
80.8% 


Model 1 
Model 2 
Model 3 


87.4% 
88.1% 
88.1% 


88.1% 
88.6% 
88.6% 


0.96 
0.91 
0.91 


65.7% 
66.5% 
66.4% 


86.3% 
86.9% 
86.9% 


86.8% 
87.5% 
87.5% 


87.6% 
88.1% 
88.1% 


1.11 
1.07 
1.07 


63.1% 
63.9% 
63.9% 


84.1% 
84.6% 
84.6% 



Table 2: Results on Section 23 of the WSJ Trccbank. LR/LP = labeled recall/precision. CBs is the average 
number of crossing brackets per sentence. CBs, < 2 CBs are the percentage of sentences with or < 2 
crossing brackets respectively. 



Labeled Recall = 

number of correct constituents in proposed parse 

number of constituents in treebank parse 

Crossing Brackets = number of con- 

stituents which violate constituent boundaries 
with a constituent in the treebank parse. 

For a constituent to be 'correct' it must span the 
same set of words (ignoring punctuation, i.e. all to- 
kens tagged as commas, colons or quotes) and have 
the same label^j as a constituent in the treebank 
parse. Table 2 shows the results for Models 1, 2 and 
3. The precision/recall of the traces found by Model 
3 was 93.3%/90.1% (out of 436 cases in section 23 
of the treebank), where three criteria must be met 
for a trace to be "correct": (1) it must be an argu- 
ment to the correct head-word; (2) it must be in the 
correct position in relation to that head word (pre- 
ceding or following) ; (3) it must be dominated by the 
correct non-terminal label. For example, in figure [s| 
the trace is an argument to bought, which it fol- 
lows, and it is dominated by a VP. Of the 436 cases, 
342 were string-vacuous extraction from subject po- 
sition, recovered with 97.1%/98.2% precision/recall; 
and 94 were longer distance cases, recovered with 
76%/60.6% precision/recall 0. 

4.1 Comparison to previous work 



Model 1 is similar in structure to ( Collins 96 ) - 
the major differences being that the "score" for each 
bigram dependency is Vi(Li,li\H, P,h, distance^ 



8 ( Magerman 95| ) collapses ADVP and PRT to the same 
label, for comparison we also removed this distinction 
when calculating scores. 

9 We exclude infinitival relative clauses from these fig- 
ures, for example "I called a plumber TRACE to fix the 
sink" where 'plumber' is co-indexed with the trace sub- 
ject of the infinitival. The algorithm scored 41%/18% 
precision/recall on the 60 cases in section 23 — but in- 
finitival relatives are extremely difficult even for human 
annotators to distinguish from purpose clauses (in this 
case, the infinitival could be a purpose clause modifying 
'called') (Ann Taylor, p.c.) 



rather than Vi(Li, P, H \ li,h, distance^, and that 
there are the additional probabilities of generat- 
ing the head and the STOP symbols for each con- 
stituent. However, Model 1 has some advantages 
which may account for the improved performance. 
The model in ( pollins "96 ) is deficient, that is for 
most sentences S, ^ T V{T \ S) < 1, because prob- 
ability mass is lost to dependency structures which 
violate the hard constraint that no links may cross. 
For reasons we do not have space to describe here, 
Model 1 has advantages in its treatment of unary 
rules and the distance measure. The generative 
model can condition on any structure that has been 
previously generated — we exploit this in models 2 
and 3 — whereas ( Collins 96| ) is restricted to condi- 
tioning on features of the surface string alone. 

( pharniak 95| ) also uses a lexicalised genera- 
tive model. In our notation, he decomposes 
V{RHSi | LHSi) as V{R n ...R 1 HL 1 ..L m | P,h) x 
IL=i..„ Hn I P, Ri, h) x n i=1 . m V{U I P, L it h). The 
Penn treebank annotation style leads to a very 
large number of context-free rules, so that directly 
estimating V(R n ...R\HLi..L m \ P,h) may lead to 
sparse data problems, or problems with coverage 
(a rule which has never been seen in training may 
be required for a test data sentence). The com- 
plement/adjunct distinction and traces increase the 
number of rules, compounding this problem. 

(Eisner 96) proposes 3 dependency models, and 



gives results that show that a generative model sim- 
ilar to Model 1 performs best of the three. However, 
a pure dependency model omits non-terminal infor- 
mation, which is important. For example, "hope" is 
likely to generate a VP (TO) modifier (e.g., I hope 
[VP to sleep]) whereas "require" is likely to gen- 
erate an S(T0) modifier (e.g., I require [S Jim to 
sleep]), but omitting non-terminals conflates these 
two cases, giving high probability to incorrect struc- 
tures such as "I hope [Jim to sleep]" or "I require [to 
sleep]". ( Alshawi 96 ) extends a generative depen- 
dency model to include an additional state variable 
which is equivalent to having non-terminals — his 



suggestions may be close to our models 1 and 2, but 
he does not fully specify the details of his model, and 
doesn't give results for parsing accuracy. (Miller ct 
al. 96| ) describe a model where the RHS of a rule is 
generated by a Markov process, although the pro- 
cess is not head-centered. They increase the set of 
non-terminals by adding semantic labels rather than 
by adding lexical head- words. 

( Magerman 95 ; Jelinek et al. 94 ) describe a 



history-based approach which uses decision trees to 
estimate T(T\S). Our models use much less sophis- 
ticated n-gram estimation methods, and might well 
benefit from methods such as decision-tree estima- 
tion which could condition on richer history than 
just surface distance. 

There has recently been interest in using 
dependency-based parsing models in speech recog- 
nition, for example (Stolcke 96). It is interesting to 
note that Models 1, 2 or 3 could be used as lan- 
guage models. The probability for any sentence can 
be estimated as V(S) = ^2 T 'P{T 1 S), or (making 
a Viterbi approximation for efficiency reasons) as 
V{S) w V(Tbest, S). We intend to perform experi- 
ments to compare the perplexity of the various mod- 
els, and a structurally similar 'pure' PCFGQ. 

5 Conclusions 

This paper has proposed a generative, lexicalised, 
probabilistic parsing model. We have shown that lin- 
guistically fundamental ideas, namely subcategori- 
sation and wh-movement, can be given a statistical 
interpretation. This improves parsing performance, 
and, more importantly, adds useful information to 
the parser's output. 
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